

### fpgaConvNet: A Toolflow for Mapping Convolutional Neural Networks on Embedded FPGAs

Dr. Christos-Savvas Bouganis

Marionet UK Many-core Research Network 11th of September, Bristol University, UK



www.imperial.ac.uk/idsl

The team

### Íntelligent Digital Systems Lab

Aug 4, 2018



Stylianos I. Venieris Machine Learning



Manolis Vasileiadis **Computer Vision** 



Mudhar Bin Rabieah

Machine Learning



Nur Ahmadi **Brain-Machine Interface** 

1 INTRODUCTION

and Nationalist per becoming the stan-of-

S. CARCARD CAS

**Konstantinos Boikos** 

Computer Vision,





**Christos-Savvas Bouganis** iDSL Lab Director Imperial College London

### Íntelligent Digital Systems Lab

### DNNs in the Embedded Space – Variability in Performance Requirements



High-Throughput Applications

Multiobjective Applications

Low-Latency Applications

3

### Íntelligent Digital Systems Lab

### DNNs in the Embedded Space – Variability in Performance Requirements



**Power constraints** 

- Absolute power consumption
- Performance-per-Watt

### Imperial College

### Íntelligent Digital Systems Lab

#### **Conventional and Unconventional Embedded Platforms for Neural Networks**



### **Research Areas / Challenges**

### Íntelligent Digital Systems Lab

### Mapping Automation

### Multiple CNN Mapping



# Challenge #1: Mapping Automation



### Íntelligent Digital Systems Lab

#### Challenge #1: Mapping Automation



### Íntelligent Digital Systems Lab

### Challenge #1: Automated CNN-to-FPGA Toolflow



### Imperial College

### fpgaConvNet – CNN Modelling Framework

### **Key Characteristics**

- Differentiation factors:
  - Streaming architecture
  - Hardware design tailored to the target CNN
  - No limit on #weights, or size of CNN
- Synchronous Dataflow Modelling for CNNs
  - CNN as a data-driven graph
  - Workload is represented as a matrix
  - Each layer mapped to a tunable set of hardware building blocks
- Design space exploration based on **transformations** 
  - Coarse-grained folding
  - Fine-grained folding
  - Graph partitioning with reconfiguration
  - Weight Reloading

### Íntelligent Digital Systems Lab



# Analytical PowerMax Throughput or Min Latency $t_{total}(B, N_P, \mathbf{\Gamma}) = \sum_{i=1}^{N_P} t_i(B, \mathbf{\Gamma}_i) + (N_P - 1) \cdot t_{reconfig.}$



### Íntelligent Digital Systems Lab

### **Under the hood: Convolutional Neural Networks (ConvNets)**



- ConvNet Inference
  - Tailored to images and data with spatial patterns
  - Built as a sequence of layers (Convolutional, Nonlinearity and Pooling Layer)



### Intelligent Digital Systems Lab

### fpgaConvNet – Streaming Architecture for CNNs



### Imperial College

### fpgaConvNet – Streaming Architecture for CNNs

#### 80 Inception-v3 ResNet-152 CNN Hardware SDF Graph VGG-16 VGG-19 ResNet-34 [%] /OEJ ResNet-18 00 Sliding GoogLeNe Sliding Nonlin Pool Unit ENet Fork Window Window fop-1 accu Unit BN-NIN Sliding Sliding Conv Nonlin 125M Pool Unit Fork Window Window Unit Unit BN-AlexNet 55 AlexNet Sliding Nonlin Sliding Conv Sliding Fork Pool Unit Window ➡ Fork Window Window 10 20 25 30 35 Operations [G-Ops] Sliding Window Sliding Conv Unit Nonlin Pool Unit Fork Unit Window **Design Space** 6 5 FPGA 1 Throughput Complex Model → Bottlenecks: 4 Current Design - Limited *compute resources* 3 Point Limited *on-chip memory capacity* for model parameters FPGA 2 — 2 Limited off-chip memory bandwidth — 1 0 0 5 10 Resources Define a set of graph transformations to traverse the design space in **fast** and **principled** way

### Íntelligent Digital Systems Lab

Inception-v4

### Íntelligent Digital Systems Lab

### **Transformation 1: Coarse-grained Folding**



### Intelligent Digital Systems Lab

### **Transformation 1: Coarse-grained Folding**



### Intelligent Digital Systems Lab



| 1) Exceeding the available (2) Not enough on-chi | ip                     |
|--------------------------------------------------|------------------------|
| compute resources memory capacity                | → FPGA Reconfiguration |

### Íntelligent Digital Systems Lab



### Intelligent Digital Systems Lab



### Intelligent Digital Systems Lab



### Intelligent Digital Systems Lab



### Intelligent Digital Systems Lab



### Intelligent Digital Systems Lab

### **Transformation 4: Weights Reloading**



Run-time vs bitstream-level reconfiguration to explore the latency-throughput trade-off



### Íntelligent Digital Systems Lab

### **Transformation 4: Weights Reloading**



### Íntelligent Digital Systems Lab

### fpgaConvNet – Design Space Exploration and Optimisation

- SDF-based Framework
  - Capture hardware mappings as matrices
  - Transformations as *algebraic operations*
  - Any local transformation *propagates* through the network
  - Static Scheduling
  - Analytical performance model
  - Cast design space exploration as a multiobjective optimization problem

 $t_{total}(B, N_P, \mathbf{\Gamma}) = \sum_{i=1}^{N_P} t_i(B, \mathbf{\Gamma}_i) + (N_P - 1) \cdot t_{reconfig.}$ 



### Intelligent Digital Systems Lab

### Meeting the performance requirements



### Íntelligent Digital Systems Lab

### **Comparison with Embedded GPUs: Same absolute power constraints (5W)**





### Íntelligent Digital Systems Lab

### **Comparison with Embedded GPUs: Performance-per-Watt**



#### fpgaConvNet vs Embedded GPU (GOp/s/W)

### Íntelligent Digital Systems Lab

### **Results: Comparison with existed FPGA frameworks**



**Other approaches** 

### Íntelligent Digital Systems Lab



Convolutional Neural Networks on FPGAs: A Survey and Future Directions", ACM Computing Surveys, 2018

## Challenge #2: Multi-CNN Systems



### Imperial College

### Íntelligent Digital Systems Lab

### Challenge #2: Multi-CNN Systems – Autonomous Drones



Íntelligent Digital Systems Lab

**The Problem setting and Challenges** 

Given a number of CNNs:

 $CNN_1$ ,  $CNN_2$ , ...,  $CNN_N$ 

find a mapping to an FPGA device that meets user requirements such as Latency and Throughout per CNN

(Extra) Challenges:

- Resource allocation per CNN
- Memory Bandwidth allocation per CNN
- Scalability

#### Imperial College London Challenge #2: Multi-DNN System

### Íntelligent Digital Systems Lab



### Íntelligent Digital Systems Lab

#### **Multi-CNN Hardware Architecture**

#### Key characteristics

- One hardware engine per CNN highly customisable
- Hardware scheduler to control memory access schedule



### Íntelligent Digital Systems Lab

### **Multi-CNN Hardware Architecture**

### Key characteristics

- One hardware engine per CNN highly customisable
- Hardware scheduler to control memory access schedule



| Parameter          | Symbol     |
|--------------------|------------|
| Pipeline structure | $\Gamma_i$ |

### Íntelligent Digital Systems Lab

#### **Multi-CNN Hardware Architecture**

### Key characteristics

- One hardware engine per CNN highly customisable
- Hardware scheduler to control memory access schedule



| Parameter                          | Symbol                       |
|------------------------------------|------------------------------|
| Pipeline structure                 | $\Gamma_i$                   |
| No. of PEs in each stage           | N <sub>PE</sub> , <i>i,j</i> |
| No of MAC operators within each PE | N <sub>op,<i>i,j</i></sub>   |

## Íntelligent Digital Systems Lab

### **Multi-CNN Hardware Architecture**

### Key characteristics

- One hardware engine per CNN highly customisable
- Hardware scheduler to control memory access schedule



| Parameter                          | Symbol                     |
|------------------------------------|----------------------------|
| Pipeline structure                 | $\Gamma_i$                 |
| No. of PEs in each<br>stage        | N <sub>PE,<i>i,j</i></sub> |
| No of MAC operators within each PE | N <sub>op,<i>i,j</i></sub> |
| Schedule                           | S                          |

### **Proposed Design Space Exploration Method**

#### Target Platform **CNN Hardware** SDF Model Model 10-3 Valid Design Space Explored Valid Design Space Explored 10-3 Valid Design Space Explored 10-3 Individual DSE 3.5 × 10<sup>-3</sup> Valid Design Space Explored ~ Individual 3 Pareto Curves × 2.5 Joint Feasible xecution Time (s) Space 2 Scheduler Proposed Flow ŵ Multi-CNN Hw / Sw Templates Hardware Mapping 0.5 Code Generato 0 0 10 20 30 50 60 40 Target set of CNNs Area (%) **HLS Files**

## Íntelligent Digital Systems Lab

### **Proposed Design Space Exploration Method**

## Íntelligent Digital Systems Lab



An example

## Íntelligent Digital Systems Lab



CNN1 CONV<sub>7x</sub> MAX POOL ReLU CONV MAX POOL ReLU 5x5 CONV ReLU 5x5 CNN2 *СОNV*<sub>3</sub>*x* MAX POOL ReLU CONV ReLU 3x3

For each CNN

- A set of subgraphs
- Bandwidth requirements

Possible memory contention

## Íntelligent Digital Systems Lab

### **Proposed Design Space Exploration Method**



- Memory contention
  - Problem 1: Performance model =! Actual performance (scheduler)
  - Problem 2: Not full utilization of the memory bandwidth
- CNN inference over a stream of inputs
  - Cast to a cyclic scheduling problem
  - Search for a periodic solution
- Optimal ILP scheduler has very high runtimes for large-sized problems
- We propose a heuristic Resource Constrained List Scheduler (RCLS).

### **Slow-down Scheduler**



## Íntelligent Digital Systems Lab

- Increase the latency and decrease the bandwidth proportionally
- One slow-down factor per subgraph

$$L'(s_{i,j}) = \frac{1}{sl_{i,j}} \times L(s_{i,j})$$
$$b'(s_{i,j}) = sl_{i,j} \times b(s_{i,j})$$

Latency Increase

Bandwidth Decrease

## Íntelligent Digital Systems Lab

### The effect of slow-downs

#### Scheduler

#### Scheduler + slow downs



Available Memory Bandwidth: 2 GB/s

### **Effect of the Proposed DSE**

- 3-CNN benchmark on ZC706
- Explored joint design points appear in triplets

   Blue → peak platform-supported performance per CNN
   Red → contention-unaware design
   Yellow → memory-aware design

## Íntelligent Digital Systems Lab



#### Full platform available bandwidth for each CNN engine

## Íntelligent Digital Systems Lab



**Comparison with Embedded GPUs** 

 Up to 19.09× speedup with an average of 6.85× (geo. mean)



**Conclusions** 

- Performance (efficiency) comes from customisation
- ML applications:
  - Fast moving area => new computational blocks appear frequently
  - Diverse application areas (ADAS, drones, Video analytics)
- To improve hardware's efficiency
   => highly customisable architecture
   => large design space
- Need for Tools

### Imperial College London Summary

## Íntelligent Digital Systems Lab

### **Research topics**





A. Kouris and C-S Bouganis, "Learning to Fly by MySelf: A Self-Supervised CNN-based Approach for Autonomous Navigation", IROS, 2018

www.imperial.ac.uk/idsl

### Imperial College London Publications

# Lintelligent Digital Systems Lab

www.imperial.ac.uk/idsl

- Alexandros Kouris, Stylianos I. Venieris, and Christos-Savvas Bouganis. 2018. CascadeCNN: Pushing the performance limits of quantisation. In SysML.
- Alexandros Kouris, Stylianos I. Venieris, and Christos-Savvas Bouganis. 2018. CascadeCNN: Pushing the Performance Limits of Quantisation in Convolutional Neural Networks. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL).
- C. Kyrkou, G. Plastiras, T. Theocharides, S. I. Venieris, and C. S. Bouganis. 2018. *DroNet: Efficient Convolutional Neural Network Detector for Real-Time UAV Applications.* In 2018 Design, Automation Test in Europe Conference Exhibition (DATE). 967–972.
- Michalis Rizakis, Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Approximate FPGA-based LSTMs under Computation Time Constraints. In Applied Reconfigurable Computing - 14th International Symposium, ARC 2018, Santorini, Greece, May 2 - 4, 2018, 3–15.
- Stylianos I. Venieris and Christos-Savvas Bouganis. 2016. *fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs.* In 2016 IEEE 24<sup>th</sup> Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 40–47.
- Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. *fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs.* In NIPS 2017 Workshop on Machine Learning on the Phone and other Consumer Devices.
- Stylianos I. Venieris and Christos-Savvas Bouganis. 2017. *fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs* (Abstract Only). *In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 291–292.*
- S. I. Venieris and C. S. Bouganis. 2017. *Latency-Driven Design for FPGA-based Convolutional Neural Networks*. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL).
- S. I. Venieris and C. S. Bouganis. 2018. *f-CNNx: A Toolflow for Mapping Multiple Convolutional Neural Networks on FPGAs.* In 2018 28th International Conference on Field Programmable Logic and Applications (FPL).
- Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. 2018. Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions. In ACM Computing Surveys 51, 3, Article 56 (June 2018), 39 pages.